Research Question:
What information can you draw from performing basic k-means
clustering on the expedition dataset within the Himalayan
Climbing Expeditions database? Find how the highpoint elevation relates
to member deaths.
Introduction:
Analysis was conducted on the expeditions dataset to
determine relationships that are not inherently apparent upon a
basic/elementary inspection. K-means clustering was performed on this
dataset to identify clusters based on the nearest mean values. With
10,364 rows and 16 columns, this dataset contains detailed remarks of
expeditions on different peaks in the Himalayan Mountains. The
expeditions dataset provides insights on injuries, deaths,
and seasonal treks, from the years 1905 to 2019.
While k-means clustering was performed on the entire dataset, the
following numerical and categorical variables within this research
document are of main interest: highpoint_metres,
members, member_deaths, and
season. Performing k-means clustering on this dataset will
allow for classification among the expeditions, highpoint, as well as
the percentage of member deaths.
Details on the specific columns are shown below:
highpoint_metres: elevation highpoint of the
expedition
members: the number of foreigners listed on the
expedition permit
member_deaths: number of expeditions members who
died
season: season of expedition (spring, summer, etc.)
The dataset analyzed within this research document,
expeditions, is shown below:
expeditions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/34eacc4ccf5878769351a5a21d5992eb28f383b6/data/2020/2020-09-22/expeditions.csv')
head(expeditions)
## # A tibble: 6 × 16
## expedi…¹ peak_id peak_…² year season basecamp…³ highpoin…⁴ terminat…⁵ termi…⁶
## <chr> <chr> <chr> <dbl> <chr> <date> <date> <date> <chr>
## 1 ANN2601… ANN2 Annapu… 1960 Spring 1960-03-15 1960-05-17 NA Succes…
## 2 ANN2693… ANN2 Annapu… 1969 Autumn 1969-09-25 1969-10-22 1969-10-26 Succes…
## 3 ANN2731… ANN2 Annapu… 1973 Spring 1973-03-16 1973-05-06 NA Succes…
## 4 ANN2783… ANN2 Annapu… 1978 Autumn 1978-09-08 1978-10-02 1978-10-05 Bad we…
## 5 ANN2793… ANN2 Annapu… 1979 Autumn NA 1979-10-18 1979-10-20 Bad we…
## 6 ANN2801… ANN2 Annapu… 1980 Spring 1980-03-25 1980-04-24 1980-05-01 Accide…
## # … with 7 more variables: highpoint_metres <dbl>, members <dbl>,
## # member_deaths <dbl>, hired_staff <dbl>, hired_staff_deaths <dbl>,
## # oxygen_used <lgl>, trekking_agency <chr>, and abbreviated variable names
## # ¹expedition_id, ²peak_name, ³basecamp_date, ⁴highpoint_date,
## # ⁵termination_date, ⁶termination_reason
The database that encompasses the expedition dataset
also includes a members dataset as well as a
peak dataset. This following research document draws from
information within the expedition dataset only.
More information regarding this dataset and parent database can be found at the following link: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-22
Pre-processing of the dataset is provided in the code snipet below:
expeditions <- expeditions %>%
filter(season != "Unknown") %>% #remove unknown seasons
dplyr::select(-c("basecamp_date","highpoint_date","termination_date","termination_reason","trekking_agency"))
head(expeditions)
## # A tibble: 6 × 11
## expedit…¹ peak_id peak_…² year season highp…³ members membe…⁴ hired…⁵ hired…⁶
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ANN260101 ANN2 Annapu… 1960 Spring 7937 10 0 9 0
## 2 ANN269301 ANN2 Annapu… 1969 Autumn 7937 10 0 0 0
## 3 ANN273101 ANN2 Annapu… 1973 Spring 7937 6 0 8 0
## 4 ANN278301 ANN2 Annapu… 1978 Autumn 7000 2 0 0 0
## 5 ANN279301 ANN2 Annapu… 1979 Autumn 7160 3 0 0 0
## 6 ANN280101 ANN2 Annapu… 1980 Spring 7000 6 1 2 0
## # … with 1 more variable: oxygen_used <lgl>, and abbreviated variable names
## # ¹expedition_id, ²peak_name, ³highpoint_metres, ⁴member_deaths,
## # ⁵hired_staff, ⁶hired_staff_deaths
Because this analysis relies on the seasons column,
removal of those expeditions where the season is not known is essential.
Only two expeditions of the entire dataset were removed due to having an
“Unknown” season. Columns of date type were also removed in this
analysis to ensure preservation of as much information as possible while
removing NA/NaN values. These columns include:
basecamp_date, highpoint_date,
termination_date, termination_reason, and
trekking_agency.
Approach:
K-means clustering is an unsupervised algorithm that does not make use of labelled data or a training dataset. This type of analysis is important for classification and to maximize the similarity of data points within clusters and minimize the similarity of points in different clusters.
Below are the basic steps of k-means clustering:
Start with k randomly chosen means
Color data points by the shortest distance to any mean
Move means to centroid position of each group of points
Repeat from step 2 until convergence
For this analysis, calculation of the percentage of member deaths was
done using the following formula:
member_deaths / members * 100 . This was done to find
relationships between highpoint_metres and
member_deaths in a way that provides insight into how many
members were present. Two k-means cluster analyses were performed to
show the differences when 0% member deaths were included and when they
were excluded.
Because the k-means clustering algorithm is not applicable to categorical data, as categorical variables are discrete and do not have any natural origin, the analysis below is done on numerical data only.
Below shows the computation of the k-means clustering:
km_fit <- na.omit(expeditions) %>%
dplyr::select(where(is.numeric)) %>% #selecting only numeric data
kmeans(
centers = 5, # number of cluster centers
nstart = 10 # number of independent restarts of the algorithm
)
Below shows a summary of the k-means clustering analysis performed from the above snipet of code:
summary(km_fit)
## Length Class Mode
## cluster 9948 -none- numeric
## centers 30 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
Analysis:
The below code and graph show k-means clustering for the
expeditions dataset with 5 clusters:
# plot
km_fit <- na.omit(expeditions) %>% #omitting NA/NaNs
dplyr::select(where(is.numeric)) %>% #selecting only numeric data
kmeans(
centers = 5, # number of cluster centers
nstart = 10 # number of independent restarts of the algorithm
)
km_fit %>%
# combine with original data
augment(na.omit(expeditions)) %>%
ggplot(
aes(highpoint_metres, member_deaths/members*100), #percentage of member deaths calculated
) +
geom_point(
aes(color = .cluster, #color by cluster
shape = season,
size = 2,
alpha=0.9)
) +
geom_point( #points of the clusters themselves
data = tidy(km_fit),
aes(fill = cluster),
shape = 21, color = "black", size = 4
) +
ggtitle("Clusters") +
labs( #adding labels
x = "Highpoint Metres",
y = "Percentage of Member Deaths (%)",
fill = "Cluster",
shape = "Season",
subtitle = "Figure 1",
caption = "*Expeditions with zero deaths were incuded within this analysis",
) +
guides(
color = "none"
) +
scale_fill_manual(
values = c("1"='#1b9e77',"2"='#d95f02',"3"='#7570b3',"4"='#e7298a',"5"='#66a61e') #custom palette
) +
scale_color_manual(
values = c("1"='#1b9e77',"2"='#d95f02',"3"='#7570b3',"4"='#e7298a',"5"='#66a61e') #custom palette
) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "top",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank(),
legend.text=element_text(size=7),
legend.spacing.y = unit(0.0, 'cm'),
) +
scale_alpha(
guide = 'none'
) +
scale_size(
guide = 'none'
)
There are 565 expeditions in which a member died. The below graph shows k-means clustering done on only the expeditions where a death occurred, again, with 5 clusters:
# plot
expeditions <- expeditions %>% filter(member_deaths != 0)
km_fit <- na.omit(expeditions) %>%
dplyr::select(where(is.numeric)) %>%
kmeans(
centers = 5, # number of cluster centers
nstart = 10 # number of independent restarts of the algorithm
)
km_fit %>%
# combine with original data
augment(na.omit(expeditions)) %>%
ggplot(
aes(highpoint_metres, member_deaths/members*100), #percentage of member deaths calculated
) +
geom_point(
aes(color = .cluster, #color by cluster
shape = season,
size = 2,
alpha=0.9)
) +
geom_point( #points at center of cluster
data = tidy(km_fit),
aes(fill = cluster),
shape = 21, color = "black", size = 4
) +
ggtitle("Clusters") +
labs( #adding labels
x = "Highpoint Metres",
y = "Percentage of Member Deaths (%)",
fill = "Cluster",
shape = "Season",
subtitle = "Figure 2",
caption = "*Expeditions with zero deaths were excluded from this analysis",
) +
guides(
color = "none"
) +
scale_fill_manual(
values = c("1"='#1b9e77',"2"='#d95f02',"3"='#7570b3',"4"='#e7298a',"5"='#66a61e') #custom palette
) +
scale_color_manual(
values = c("1"='#1b9e77',"2"='#d95f02',"3"='#7570b3',"4"='#e7298a',"5"='#66a61e') #custom palette
) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "top",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank(),
legend.text=element_text(size=7),
legend.spacing.y = unit(0.0, 'cm'),
) +
scale_alpha(
guide = 'none'
) +
scale_size(
guide = 'none'
)
The below code and figure show the scree plot of the k-means clusters
for the expeditions dataset where a death occurred:
# function to calculate within sum squares
calc_withinss <- function(data, centers) {
km_fit <- dplyr::select(data, where(is.numeric)) %>%
kmeans(centers = centers, nstart = 10)
km_fit$tot.withinss
}
tibble(centers = 1:15) %>%
mutate(
within_sum_squares = map_dbl(
centers, ~calc_withinss(iris, .x)
)
) %>%
ggplot() +
aes(centers, within_sum_squares) +
geom_point(color = "#d95f02",size=3) +
geom_line(color = "#d95f02", size=1.3) +
ggtitle("Sum of Squares Scree Plot") +
labs( #adding labels
x = "Number of Clusters",
y = "Within Sum of Squares",
subtitle = "Figure 3",
caption = "*Expeditions with zero deaths were excluded from this analysis",
) +
scale_color_manual(
values = c('#1b9e77','#d95f02','#7570b3','#e7298a','#66a61e') #custom palette
) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "top",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank(),
legend.text=element_text(size=7),
legend.spacing.y = unit(0.0, 'cm'),
)
Discussion:
K-means clustering is an essential part of machine learning and data science because it allows for better interpretation of data. This type of analysis is important to optimize/maximize the similarity of data points within clusters. The clusters generated from this analysis are used for classification.
The analysis done on the expeditions dataset shows
various relationships. Figure 1 describes the k-means clustering
performed on the expeditions dataset. This basic analysis
shows clusters that are most notably divided by the highpoint elevations
of the expedition. Because it is difficult to draw conclusions from this
analysis due to the influx of expeditions with 0% recorded member
deaths, additional analysis was performed omitting that data.
In Figure 2, the clusters generated by the k-means clustering
analysis show that generally the lower the highpoint elevation of the
expedition, the greater the percentage of deaths. This figure gives
better insight into what seasons the highest percentage of deaths occur.
There are other variables that factor into what seasons has the highest
percentage of deaths, such as highpoint_metres, shown on
the x-axis. The clusters in this graph vary slightly from the clusters
in Figure 1, depicting the expeditions with 0%+ member deaths.
Finally, Figure 3, titled: Sum of Squares Scree Plot, shows the sum of square means across different cluster amounts. As the number of clusters increases, the variance (within-group sum of squares) decreases. The elbow at three clusters represents balance between minimizing the number of clusters and minimizing the variance within each cluster, achieving parsimony within these parameters.